Julio C. Gonzalez

May 8, 2018

!Working Document!

Add proposed reading time.

Hi there, the Stats Whisperer here, back presenting a whole new topic that is a hot commodity these days which is sentiment analysis. Believe it or not but social media can have a powerful impact on a business. Can you imagine a company losing $1.3 billion (not a typo, as in 1,300,000,000 bucks) for a single tweet? Well that was what reportedly happened to Snapchat Inc. when Kylie Jenner tweeted about snapchat allegedly dropping their stock value by 7.5%.

The amount of data generated by social networks is safe to say enormous. What if we could use that same data to find insightful information on consumer perspective on just about anything. Well that’s exactly what we are going to do.


NLP for Movie Reviews

Ever had a heated discussion with your friends about a particular movie or actor? While you have a strong distaste for anything Nicolas Cage, your “amigo” is in a infatuated trance with all of his movies. So what do you do when the next National Treasure movie comes out? Naturally, you go, faute-de-mieux, online to investigate what the reviews are for the movie. Well, I won’t get into the particulars as to why of those movie reviews are flawed (actually here is a very detailed article expanding on this issue from my favorite statistically inclined website FiveThirtyEight) but one salient issue is that these reviews are one dimensional. All they do is give a score from 1 to 10 or 1 to 5 stars for a particular movie but they don’t tell you why they are given this score. Sure, some people add a their opinion on the movie but who has time to read all of them? Furthermore, these movie rating sites are highly correlated with each and here we have some data to confirm that.

fandango <- read_csv("~/Documents/website/Sentiment-Analysis/fandango_score_comparison.csv")
t1 = fandango[,c(2:5)]
#corrplot(cor(t1), method="circle")
hchart(cor(t1)) %>% hc_add_theme(hc_theme_darkunica())

In the above diagram, we did a simple correlation plot between the movie ratings given by the major movie review sites and found all of them to be highly correlated with each other. For those non-stats heads out there, a correlation plot deploys a simple statistical algorithm known as the Pearson correlation value that finds how correlated two variables are to each other. There are several ways this formula can be written but here is one that I think makes the more intuitive sense. \[r = \frac{\sum_{i = 1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i = 1}^{n}(x_i - \bar{x})^2}\sqrt{\sum_{i = 1}^{n}(y_i - \bar{y})^2}}\] The scale falls between -1 and 1 (inclusive) where a value of -1 means that the direc

I propose a solution: why don’t we use the massive amount of user generated data of social networks to try and understand what are the reasons a user feels a certain way towards a movie. Instead of using the small sample size of the users on movie review sites, let’s include a larger, more diverse and a more representative sample of what the average paying customer that drives films success. In addition, we might be able to extract what elements of movies makes it more likely for a good review. I don’t mean to get all Freudian but it could give us a window into the subconscious feelings towards a particular movie.

In order to do that we would be using natural language processing. The following diagram gives a rough explanation as to what encompasses NLP and how it is divided. There are other areas where NLP has application significance but for the most part it falls under these two categories.

Natural Language Processing Diagram

Natural Language Processing Diagram

Our focus will be the utilization of sentiment analysis but we will also use NLP to clean up the syntax of the data using stopword removal.

Paradoxically, in order for someone to give a horror movie a high rating is by being scared out of your socks.


Deadpool

I will confess that I am a big superhero fan and will watch anything superhero related.

In order to conduct this analysis the most important thing we will need will be the data. The great thing about Twitter is that it has an API (application programming interface) where instead of writing a script that will scrap the tweets from the webpage, we have direct access to the database where all the data on the tweets live. On top of that, there is an R package that has all the code wrapped up in a function that will do all of the leg work for you. All you have to do is pass the parameters for what you want to search for. Let’s see how it works.

## Pulling Twitter Data 
deadpool1 <- searchTwitter("Deadpool", n=34000, lang="en", since = "2018-05-16", until = "2018-05-17")     # Pull Twitter feed for Deadpool.
deadpool1<- twListToDF(deadpool1)     # Convert into a data frame for analysis.

In the above code, we pulled 34,000 tweets containing the word “Deadpool” right up until the day before the premiere (movie debuted on May 18, 2018 in the US). Our goal is to analyze the sentiment before the movie was released to use it as a reference point and compare it to the sentiment after the movie premiered to see if we can find something there.

Let’s take a closer look at the data.

head(deadpool1)
##                                                                                                                                           text
## 1                                                                                                                           DEADPOOL 2. \u2694️
## 2                         RT @BlakeNorthcott: We're all living in 2018. Deadpool 2's marketing team is living in 3018. https://t.co/jL4cEIduZN
## 3                         RT @BlakeNorthcott: We're all living in 2018. Deadpool 2's marketing team is living in 3018. https://t.co/jL4cEIduZN
## 4                                        RT @ManInTheHoody: Ronan Farrow is Clark Kent, Robert Mueller is Batman.\n\nAnd Avenatti is Deadpool.
## 5                                                          I enjoyed Deadpool 2… \n\nbut by god the diabetes jokes get old real fucking quick.
## 6 RT @rainbowslinky: guys look at the movie rack at walmart! brett just sent me the photo. the promo for deadpool is unbelievable https://t.c…
##   favorited favoriteCount replyToSN             created truncated
## 1     FALSE             0      <NA> 2018-05-17 02:38:11     FALSE
## 2     FALSE             0      <NA> 2018-05-17 02:38:09     FALSE
## 3     FALSE             0      <NA> 2018-05-17 02:38:08     FALSE
## 4     FALSE             0      <NA> 2018-05-17 02:38:06     FALSE
## 5     FALSE             1      <NA> 2018-05-17 02:38:04     FALSE
## 6     FALSE             0      <NA> 2018-05-17 02:38:03     FALSE
##   replyToSID                 id replyToUID
## 1       <NA> 996942929460543489       <NA>
## 2       <NA> 996942923785617408       <NA>
## 3       <NA> 996942915673972737       <NA>
## 4       <NA> 996942908656947200       <NA>
## 5       <NA> 996942901396439040       <NA>
## 6       <NA> 996942894849314817       <NA>
##                                                                              statusSource
## 1      <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 2      <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 3      <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 4      <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 5 <a href="https://tapbots.com/software/tweetbot/mac" rel="nofollow">Tweetbot for Mac</a>
## 6      <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
##    screenName retweetCount isRetweet retweeted longitude latitude
## 1 ExequielEra            0     FALSE     FALSE      <NA>     <NA>
## 2    kodemily         2283      TRUE     FALSE      <NA>     <NA>
## 3 softnbitter         2283      TRUE     FALSE      <NA>     <NA>
## 4    rnelson0         1095      TRUE     FALSE      <NA>     <NA>
## 5    yakmoose            0     FALSE     FALSE      <NA>     <NA>
## 6   S0UL4SALE         1100      TRUE     FALSE      <NA>     <NA>

Here we see a snippet of the first 6 rows of the data. It starts off with text of the tweet followed by all sorts of information like when it was created, the device used to tweet, the screen name of the user, etc. We are primarily interested in the text of the tweet itself but this is just to show you that there is potential to conduct an even further analysis using Twitter data.

summary(deadpool1)
##      text           favorited       favoriteCount       replyToSN        
##  Length:34000       Mode :logical   Min.   :   0.000   Length:34000      
##  Class :character   FALSE:34000     1st Qu.:   0.000   Class :character  
##  Mode  :character                   Median :   0.000   Mode  :character  
##                                     Mean   :   0.803                     
##                                     3rd Qu.:   0.000                     
##                                     Max.   :5315.000                     
##     created                    truncated        replyToSID       
##  Min.   :2018-05-16 22:17:19   Mode :logical   Length:34000      
##  1st Qu.:2018-05-17 00:29:42   FALSE:32426     Class :character  
##  Median :2018-05-17 12:43:54   TRUE :1574      Mode  :character  
##  Mean   :2018-05-17 12:08:23                                     
##  3rd Qu.:2018-05-17 23:48:36                                     
##  Max.   :2018-05-18 00:47:43                                     
##       id             replyToUID        statusSource      
##  Length:34000       Length:34000       Length:34000      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##   screenName         retweetCount    isRetweet       retweeted      
##  Length:34000       Min.   :     0   Mode :logical   Mode :logical  
##  Class :character   1st Qu.:     0   FALSE:10117     FALSE:34000    
##  Mode  :character   Median :   763   TRUE :23883                    
##                     Mean   : 15616                                  
##                     3rd Qu.: 11669                                  
##                     Max.   :146426                                  
##   longitude           latitude        
##  Length:34000       Length:34000      
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Here is also some summary statistics.

By giving the data a quick glance, we see that our data captures a particular tweet that went viral leading up to the movie premiere.

Here is that tweet. Our data captures some of the retweets that made this tweet go viral.

Here is that tweet. Our data captures some of the retweets that made this tweet go viral.


Regular Expression

Every now and then, I like to include special topics (like regular expression) that don’t strictly fall under the subject at hand (NLP in this case) per se, but has some value when used in conjunction.

Going back to our data. First, we will create a vector that contains all of the text from the tweets.

We will see that the tweets are “dirty” meaning that they contain usefulness information that is not needed to do sentiment analysis. In order to clean them up, we will use something called regular expressions. You can think of regular expression as a very powerful string search language that can be used to replace individual letters or entire sections of text.

For example, this the text for the viral aforementioned tweet.

twtxt <- deadpool1$text     # Collect text for sentiment analysis.
twtxt[2] 
## [1] "RT @BlakeNorthcott: We're all living in 2018. Deadpool 2's marketing team is living in 3018. https://t.co/jL4cEIduZN"

We can see it contains a link, a screen name and special characters which are of no use to us. However in order to remove them, we must first identify them using regular expressions.

twtxt <- gsub("http://t.co/[a-z,A-Z,0-9]*{8}" , "", twtxt)      # Clean up Tweets. Remove certain characters.
twtxt <- gsub("https://t.co/[a-z,A-Z,0-9]*{8}", "", twtxt)
twtxt <- gsub("RT", "", twtxt) # Removes "RT" indicating a retweet
twtxt <- gsub("@.*:", "", twtxt) # Removes screen names from retweets
twtxt[2] #print clean text
## [1] "  We're all living in 2018. Deadpool 2's marketing team is living in 3018. "

After we removed all the usefulness stuff, we now see clean text that we can use to extract the sentiment of the tweet.


Syntax NLP

As we get closer to sentiment analysis, let’s take a step back. Even though our tweets are “clean”, they still contain irrelevant information. In order to get the core of sentiment we must first remove stopwords. Stopwords are simply commonly used words that do not contribute to the significance of the overall text. Examples include: the, to, and, that, etc. In addition, we remove numbers and punctuation as well. The following code does this:

Encoding(twtxt) <- "latin1"                     
twtxt <- iconv(twtxt, from="latin1", to="ASCII", sub="")        # Convert to ASCII format.

mycorpus <- Corpus(VectorSource(twtxt))                 # Create corpus.
mycorpus <- tm_map(mycorpus, content_transformer(tolower))  # Create corpus of text while making all characters lowercase.
mycorpus <- tm_map(mycorpus, stripWhitespace)           # Remove white space.
mycorpus <- tm_map(mycorpus, removePunctuation)         # Remove punctuation.
mycorpus <- tm_map(mycorpus, removeNumbers)         # Remove numbers.
mycorpus <- tm_map(mycorpus, removeWords, stopwords())  # Remove stop words
mycorpus <- tm_map(mycorpus, removeWords, c("deadpool")) # Remove other words.
mycorpus[[2]]$content     # View the content of tweet.
## [1] "   living    s marketing team  living   "

Now we can see how the original tweet has been reduced to a few simple strings. With this data at hand, we can create a word cloud.

As expected, we see the word cloud is dominated by the words found in the viral tweet.


Semantic NLP

Now we get to the meat and potatoes of NLP and apply sentiment analysis to the our text corpus to try to find some useful insight into these tweets. Before we do that, let’s look into what sentiment analysis is and what it actually does. There are many types of sentiment analyses so we’ll start with a simple one, polarity score. What this does it compares the words in each tweet and compares it to a dictionary of polarized words. A positive score indicates a positive sentiment while the converse will imply the same thing. The score will depend on the dictionary that you are using. This function defaults to a combined and augmented version of Jocker’s (2017) [originally exported by the syuzhet package] & Rinker’s augmented Hu & Liu (2004) dictionaries in the lexicon package, however, this may not be appropriate, for example, in the context of children in a classroom. The user may (is encouraged) to provide/augment the dictionary (see the as_key function). For instance the word “sick” in a high school setting may mean that something is good, whereas “sick” used by a typical adult indicates something is not right or negative connotation (deixis).

# Perform the sentiment analysis on Tweets.
final <- data.frame(text=sapply(mycorpus, identity), stringsAsFactors=F)
#sent_combo <- sentiment(final$text)     # Pull sentiment.
head(sent_combo)
##    element_id sentence_id word_count sentiment
## 1:          1           1         NA 0.0000000
## 2:          2           1          5 0.0000000
## 3:          3           1          5 0.0000000
## 4:          4           1          8 0.0000000
## 5:          5           1          9 0.5333333
## 6:          6           1          1 0.0000000

Here every row represents the output of the sentiment analysis for all of the 34,000 tweets where the “sentiment” column displays the score given to that particular tweet.

mean(sent_combo$sentiment)     # average sentiment
## [1] 0.05626166

Finally, we find that average sentiment falls at 0.0562 which is just slightly positive.

Here we see it visually.

Here we see the sentiment score plotted across all of our tweets. We observe a few crazy scores here and there but for the most part we see that most tweets are neutral. Feel free to zoom in the histogram by selecting across several columns.

As explained previously, you will get different results based on the sentiment dictionary you are using. Here, we will be using syuzhet dictionary.

#sent_syuzhet <- get_sentiment(final$text, method="syuzhet")     # method="bing", "afinn", "nrc", "stanford" 
mean(sent_syuzhet)     # average sentiment
## [1] 0.2009029

Using the syuzhet dictionary, we now get an average sentiment score of about 0.2 which is not that much terribly more than when we used previously.

Generally, it is expected that the ratio between positive and negative will be correlated with the movie review which is great but not very insightful. We can actually go at a much deeper level by acquiring a sentiment score of more particular kind of emotions. We can achieve that by using the NCR dictionary where each tweet results in a score to 8 additional categories that now include sentiments like: anger, anticipation, disgust, fear, joy, sadness and trust.

##   anger anticipation disgust fear joy sadness surprise trust negative
## 1     0            0       0    0   0       0        0     0        0
## 2     0            0       0    0   0       0        0     1        0
## 3     0            0       0    0   0       0        0     1        0
## 4     0            0       0    0   0       0        0     0        0
## 5     0            1       0    1   1       0        0     2        0
## 6     0            0       0    0   0       0        0     0        0
##   positive
## 1        0
## 2        0
## 3        0
## 4        0
## 5        2
## 6        0

Part 2

Having inspected the sentiment prior to Deadpool 2’s premier, now we would like to see the sentiment as more and more people see the movie.

First we will pull tweets on the day it premiered and the day after.

## Pulling Twitter Data 
deadpool2 <- searchTwitter("Deadpool", n=34000, lang="en", since = "2018-05-18", until = "2018-05-19")     # Pull Twitter feed for Deadpool after premiere
deadpool2<- twListToDF(deadpool2)     # Convert into a data frame for analysis.
twtxt2 <- deadpool2$text 
twtxt2 <- gsub("http://t.co/[a-z,A-Z,0-9]*{8}" , "", twtxt2)        # Clean up Tweets. Remove certain characters.
twtxt2 <- gsub("https://t.co/[a-z,A-Z,0-9]*{8}", "", twtxt2)
twtxt2 <- gsub("RT", "", twtxt2) # Removes "RT" indicating a retweet
twtxt2 <- gsub("@.*:", "", twtxt2) # Removes screen names from retweets

Encoding(twtxt2) <- "latin1"                    
twtxt2 <- iconv(twtxt2, from="latin1", to="ASCII", sub="")      # Convert to ASCII format.

mycorpus2 <- Corpus(VectorSource(twtxt2))           # Create corpus.
mycorpus2 <- tm_map(mycorpus2, content_transformer(tolower))    # Create corpus of text.
mycorpus2 <- tm_map(mycorpus2, stripWhitespace)     # Remove white space.
mycorpus2 <- tm_map(mycorpus2, removePunctuation)       # Remove punctuation.
mycorpus2 <- tm_map(mycorpus2, removeNumbers)           # Remove numbers.
mycorpus2 <- tm_map(mycorpus2, removeWords, stopwords()) #Remove stop words.
mycorpus2 <- tm_map(mycorpus2, removeWords, c("deadpool")) # Remove other words.

mean(sent_combo2$sentiment)    # average sentiment
## [1] 0.1102167

We can see the sentiment has actually gone up.

#sent_nrc2 <- get_nrc_sentiment(final2$text)     # Get more detailed sentiment.

Comparisons

Dictionary Used Average Sentiment Before Deadpool 2 Premier Average Sentiment After Deadpool 2 Premier Difference Percent Change
combo 0.0562617 0.1102167 0.0539550 95.9%
syuzhet 0.2009029 0.3105250 0.1096221 54.56%
bing 0.1982353 0.3106471 0.1124118 56.71%
afinn 0.7402353 1.7267353 0.9865000 133.27%
nrc 0.0705294 0.0388235 -0.0317059 -44.95%
Note:
Combo dictionary is using a combined and augmented version of Jocker’s (2017) & Rinker’s augmented Hu & Liu (2004) dictionaries

As we can observe, 4 out of the 5 dictionaries showed an increase in positivity of more than 50% meaning the tweets generated once the people saw the movie contained more positive information than the ones prior to movie premier.

I looks like people are actually liking the movie right? Well not exactly. Lets compare the sentiment more advance sentiment scores.

Sentiment Total Score Before Deadpool 2 Premier Total Score After Deadpool 2 Premier Difference Percent Change
anger 1701 2393 692 40.68%
anticipation 6776 6464 -312 -4.6%
disgust 1823 2024 201 11.03%
fear 2781 3245 464 16.68%
joy 4368 5275 907 20.76%
sadness 2423 2390 -33 -1.36%
surprise 6615 16739 10124 153.05%
trust 12607 10764 -1843 -14.62%

While on one hand we observe a large increase in the total scores of surprise and joy paired with a decrease in sadness, we simultaneously observe an increase of about 40%, 16% and 11% in anger, fear and disgust respectively.

Even though we saw an average increase in positive score we simultanously observe an increase in negative sentiments. How can that be? Well, it might be increase in surprise scores overshadows all the little negative scores. Or,paradoxically, maybe, it’s the feelings of anger and fear that are making people good about the movie. It’s weird. Similar to how in order for a scary movie to be good you need to be scared your socks off. Which is a little tricky because fear would be considered a negative emoition yet the success of a horror film depends on you actually feeling scared.


Conclusion

Using NLP with social media data is a different type of approach to movie reviews and one in which one can argue provides a closer and more organic assessment of public perception towards a particular film.

Since social networks are perpetually generating data, they provide a more real time information stream of the consumer’s preferences akin to how a company’s stock value is a measure of its performance. Instead of the traditional approach where one sees a movie then writes a review one time, we have a constant feed

Thinking on a much larger scale, we can expand this to include multiple social networks that include even more niche users like Reddit, Quora and even comments on YouTube and do comparisons across networks. Furthermore, since the production of a movie incorporates people who are continuously in the public’s eye, we can observe the effect of an important event in an actor’s life outside of films sort of like how music sales shoot through the roof in the aftermath of an artist’s death.

To bring this home, Basically, we have unprecedented access to a multi facet view towards the film industry. In truth, while we now have more data available, extracting the signal rather than the noise from this data is, even today, an incredibly difficult task. If you pile privacy concerns on top of that, it can really get messy. Sure movie production companies with deep pockets can use this information for to improve their bottom line, but the question is should they have access to this type of intimate information. Who knew that a simple tweet provides so much information about you.

In the words of Peter Parker in remembrance to his uncle Ben at the end of the spider-man movie,

“With great power comes, great responsibility.”

Thanks for reading.